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Supercomputer  Programming  Environments 
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Vincent  A.  Guarna,  Jr. 

Duncan  H.  Lawrie 

June  9,  1987 
Abstract 

Tho  quest  to  apply  an  ever-increasing  amount  of  computing  power  to  numerical  applica¬ 
tions  has  resulted  in  the  evolution  of  a  broad  spectrum  of  ideas  and  implementations  for 
high  performance  computing  systems.  The  architectural  complexity  of  these  high  perfor¬ 
mance  systems  typically  requires  special  tools  and  techniques  to  achieve  efficient  utiliza¬ 
tion  of  available  computational  resources.  These  tools  range  from  automatic  restructuring 
and  optimizing  compilers  to  interactive  debugging  and  performance  analysis  systems.  The 
programming  environment  for  these  systems  must  be  general  and  adaptive,  providing  the 
appropriate  level  of  assistance  for  users  of  varying  levels  of  sophistication.  This  paper 
presents  recent  developments  in  supercomputer  environments,  and  focuses  in  more  detail 
on  the  Cedar  Project  which  is  currently  under  way  at  the  University  of  Illinois  Center  for 
Supercomputing  Research  and  Development.  The  Cedar  Project  consists  of  the  construc¬ 
tion  of  a  prototype  multiprocessor,  restructuring  compilers  for  the  Fortran  and  C  pro¬ 
gramming  languages,  and  an  integrated  graphics-based  programming  environment 
intended  to  serve  the  needs  of  scientific  applications  users. 

Keywords:  scientific  computation,  parallel  computation,  parallel  languages,  vector 
languages,  programming  environments,  optimizing  compilers,  parallel  debuggers 


This  paper  will  appear  in  the  Proceedings  of  the  Symposium  on  Parallel  Computations 
and  Their  Impact  on  Mechanics,  to  be  held  at  the  ASME  Winter  Annual  meeting  in  Bos¬ 
ton,  December  13-18,  1987. 
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1  ABSTRACT 

The  quest  to  apply  an  ever-increasing  amount  of  computing  power  to  numerical  applica¬ 
tions  has  resulted  in  the  evolution  of  a  broad  spectrum  of  ideas  and  implementations  for 
high  performance  computing  systems.  The  architectural  complexity  of  these  high  perfor¬ 
mance  systems  typically  requires  special  tools  and  techniques  to  achieve  efficient  utiliza¬ 
tion  of  available  computational  resources.  These  tools  range  from  automatic  restructuring 
and  optimizing  compilers  to  interactive  debugging  and  performance  analysis  systems.  The 
programming  environment  for  these  systems  must  be  general  and  adaptive,  providing  the 
appropriate  level  of  assistance  for  users  of  varying  levels  of  sophistication.  This  paper 
presents  recent  developments  in  supercomputer  environments,  and  focuses  in  more  detail 
on  the  Cedar  Project  which  is  currently  under  way  at  the  University  of  Illinois  Center  for 
Supercomputing  Research  and  Development.  The  Cedar  Project  consists  of  the  construc¬ 
tion  of  a  prototype  multiprocessor,  restructuring  compilers  for  the  Fortran  and  C  pro¬ 
gramming  languages,  and  an  integrated  graphics-based  programming  environment 
intended  to  serve  the  needs  of  scientific  applications  users. 

2  INTRODUCTION 

Present  supercomputers1  require  vectorization  of  codes  to  achieve  anywhere  near  their 
performance  potential.  Additionally,  on  some  machines,  vector  registers  must  be  carefully 
managed  to  avoid  as  much  memory  access  as  possible  lest  memory  access  become  a 
bottleneck.  This  must  all  be  done  while  managing  disk  I/O  which  is  vastly  slower  than 

1  In  this  piper  we  use  the  word  supercomputer  to  refer  to  the  fastest  tceneral-purpose  seirntiGe  comput¬ 
ers,  e.R.,  machines  like  the  Cray  X*N1  P,  Cray  II,  CDC  Cyber  Z05,  ETA  10,  Fujitsu  b  acorn  V*P ,  Hitachi  S- 
810,  and  NEC  SX.  These  machines  can  have  at  least  several  processors  all  sharing  a  common  memory. 
Other  machines,  for  example  NCCRE,  hypercube,  etc.,  have  many  processors  and  a  distributed  (unshared) 
memory.  These  latter  machines  are  striving  for  "super"  status,  and  indeed  some  of  these  can  provide  super¬ 
computer  performance  on  certain  applications.  However,  our  expertise  lies  in  the  former  area  and  we  will 
not  address  the  latter  machines. 


the  processor  and  I/O  from  some  form  of  bulk  random  access  memory. 

The  next  generation  of  supercomputers  will  be  even  more  complex  than  present  super¬ 
computers.  Machines  with  multiple  processors  are  already  in  the  field.  This  will  be  the 
primary  characteristic  of  the  next  generation  of  supercomputers — multiprocessing2 — but 
more  so.  Yet  techniques  for  using  multiprocessors  are  poorly  understood  today.  Not  only 
do  programs  need  to  be  rewritten  as  they  were  for  vector  machines  to  capitalize  on  perfor¬ 
mance  gains  available  from  multiprocessors,  but  often  the  algorithms  themselves  must  be 
redesigned  to  allow  multiprocessing.  Optimizing  compilers  are  just  beginning  to  do  a 
credible  job  of  vectorization.  They  are  a  long  way  from  being  able  to  restructure  a  pro¬ 
gram  to  use  multiprocessors  effectively,  a  process  we  call  parallelization.  Further,  the 
design  of  parallel  algorithms  is  still  a  relatively  new  art. 

Multiprocessing  leads  to  other  difficulties  as  well.  Programming  multiple,  asynchro¬ 
nous  tasks  is  probably  an  order  of  magnitude  more  difficult  than  what  most  of  us  are  used 
to  programming— a  single  execution  stream.  Once  we  start  using  multiple  execution 
streams,  we  must  be  careful  about  cases  where  multiple  streams  access  the  same  data. 
Where  data  access  by  multiple  execution  streams  might  cause  a  problem,  we  must  use 
some  form  of  synchronization.  And  in  machines  lacking  a  shared  memory,  data  must  be 
explicitly  moved  from  processor  to  processor  (or  sometimes,  rather  than  moving  the  data, 
in  effect  the  program  is  moved  from  processor  to  processor). 

Debugging  parallel  programs  is  also  an  order  of  magnitude  more  difficult.  For  exam¬ 
ple,  most  errors  are  not  easily  reproduced  because  the  exact  time  ordering  of  the  multiple 
execution  streams  will  vary  from  one  run  to  the  next.  Thus,  errors  caused  by  poor  con¬ 
ceptualization  of  the  synchronous/asynchronous  nature  of  the  program  are  not  only  the 
easiest  errors  to  m2  ke,  but  the  most  difficult  to  find  and  reproduce. 

Memories  will  also  increase  in  complexity.  For  example,  many  new  supercomputers 
will  of  necessity  use  cache  memories  to  better  match  the  processor  and  memory  speeds. 
Consider,  however,  the  effect  of  vectorization  on  cache  memories.  When  a  program  is  vec¬ 
torized,  the  result  is  usually  statements  that  compute  on  whole  vectors  at  a  time.  Often 
each  vector  in  such  a  statement  is  only  accessed  once.  Yet,  if  the  vector  is  long  (and  the 
longer  the  better  for  performance),  then  it  may  have  the  effect  of  flushing  the  cache. 
Already,  we  see  several  common  optimizations  at  odds.  This  same  phenomenon  adversely 
affects  the  locality  of  programs  in  machines  which  have  paged  memories.  But  the  com¬ 
plexity  does  not  stop  there.  New  machines  are  likely  to  contain  some  mixture  of  local 
memories  (memory  accessible  by  only  one  processor)  and  shared  memory  (memory  shared 
by  some  or  all  of  the  processors).  To  get  the  full  potential  performance  of  these  machines, 
data  must  be  carefully  allocated  in  the  best  memory  depending  on  its  access  characteris¬ 
tics.  Often  this  allocation  must  change  during  the  execution  of  the  program. 

Add  to  all  this  the  increasing  disparity  between  processor  speed  and  I/O  speed,  and  we 
have  even  greater  need  for  complexity  in  how  I/O  is  handled.  This  in  turn  necessitates 
multitrackcd  I/O  (streaming  data  to  or  from  multiple  disks  simultaneously,)  disk  caches, 
and  bulk  random  access  memories. 

It  is  perhaps  ironic  that  in  this  age  of  the  microprocessor  and  personal  computer,  when 
software  is  finally  becoming  easy,  perhaps  even  fun  to  use,  supercomputers  are  becoming 
more  difficult  to  use.  We  must  see  to  it  that  this  does  not  happen.  Better  programming 
environments  are  needed —  compilers,  languages,  debugging  and  performance  tools— if  we 
are  to  make  use  of  the  tremendous  potential  offered  by  supercomputers. 

3  We  define  multiprocessing  to  mean  the  use  of  more  than  one  processor  on  one  job.  This  is  sometimes 
called  multitasking,  and  the  term  parallel  processing  has  also  come  into  vogue  to  mean  the  same  thing.  This 
is  different  from  vector  processing,  which  means  computing  on  vectors,  usually  with  a  pipeline  processor  like 
the  Cray  series. 


3  ISSUES  IN  PARALLEL  PROGRAMMING  LANGUAGES 

Several  approaches  are  possible  in  the  design  and  selection  of  programming  languages 
for  parallel  processing.  In  this  section  we  will  discuss  Fortran  and  its  extensions.  A  few 
remarks  will  be  made  at  the  end  on  alternative  languages. 

Fortran  will  be  emphasized  due  to  its  predominance.  It  is  safe  to  say  that  most  of  the 
application  code  for  parallel  scientific  computers  is  in  the  form  of  numerical  programs 
written  in  Fortran,  and  that  this  situation  will  continue  in  the  near  future.  Supercomput¬ 
ers  use  either  an  optimizing  compiler  or  Fortran  extensions  to  exploit  both  vector  and 
asynchronous  parallelism.  We  will  discuss  these  two  forms  of  parallelism  next,  starting 
with  vector  parallelism. 

Some  vendors  use  standard  sequential  Fortran  and  rely  on  the  compiler  to  exploit  vec¬ 
tor  parallelism.  These  compilers  include  a  vectorization  phase  where  regular  do  loops  are 
internally  transformed  into  vector  assignment  statements.  To  give  the  programmer  con¬ 
trol  over  what  is  vectorized  and  how,  these  Fortran  compilers  all  accept  some  form  of  vec¬ 
torization  commands  supplied  via  comment  cards.  Also,  a  programming  style  may  be  sug¬ 
gested  to  the  programmer  to  help  the  vectorizer.3  The  main  advantage  of  using  standard 
sequential  Fortran  is  portability.  Thus,  Fortran  programs  (even  if  they  were  not  written 
for  supercomputers)  can  often  be  efficiently  run  on  a  new  supercomputer  either  without 
change  or  with  the  addition  of  a  few  compiler  directives  with  vectorization  commands  for 
the  new  machine. 

Another  possible  approach  to  exploit  vector  parallelism  is  to  extend  Fortran  with  vec¬ 
tor  assignment  statements.  Four  types  of  constructs  have  been  used  for  vector  assignment 
statements.  The  first  construct,  control  vectors,  was  used  by  two  early  vector  languages: 
the  Burroughs  IUiac  IV  Fortran  (Ref.  11),  and  Glypnir  (Ref.  33).  The  latter  language  was 
also  designed  for  the  IUiac  IV,  and  while  it  was  based  on  Algol,  its  control  structures  could 
be  trivially  incorporated  into  Fortran.  Control  vectors  were  boolean  vectors  used  to  con¬ 
trol  vector  operations.  Burroughs  IUiac  IV  Fortran  used  control  vectors  as  array  sub¬ 
scripts.  A  *  denoted  a  boolean  vector  with  all  elements  set  to  true.  Thus, 

Real  A (100),  B(IOO) 

A(*)  =  B(*)  ♦  A  ( * )  (1) 

added  corresponding  elements  of  arrays  A  and  B  and  assigned  the  result  to  array  A.  On 
the  other  hand, 

do  lO  1  =  1,  lOO,  2 
M  (i)  =  .true. 

M  (1+1)  =  .false.  (2) 

lO  continue 

A  (M(*))-B(M(*))  *  A  (M  (  *  )  ) 

did  the  same  thing  but  only  for  the  odd  elements  of  A  and  B. 

In  Glypnir  control  vectors  had  61  elements,  one  for  each  IUiac  IV  processor  (I’E),  which 
were  used  to  control  whether  a  processor  was  to  execute  or  remain  idle.  Variables  in 
Glypnir  could  be  declared  to  be  of  the  pc  type;  this  specified  that  there  would  be  a  copy 
of  the  variable  on  each  processor.  To  illustrate  these  ideas  consider  the  following  pro¬ 
gram: 


3  For  example,  the  Cray  CFT  manual  suegests:  "Keep  subscripts  simple  and  explicit;  do  not  use 
parentheses  in  subscripts." 
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pe  real  x,y,z 


for  all  z<Odox  =  y+  l 

This  program  specified  that  x,  y  and  z  were  64  element  arrays,  and  that  for  1  <  i  < 
64,  y  (i)  +  1  was  to  be  assigned  to  x(i)  whenever  z  (i)  <  0.  In  the  for  all 

statement,  the  ith  element  of  the  control  vector  had  the  boolean  value  ’z  (i)  <  O’. 

The  language  IVTRAN,  also  developed  for  the  Illiac  IV,  introduced  a  second  type  of 
construct:  do  for  all  (Ref.  38).  This  construct  specified  the  subscripts  to  be  used  in 
the  vector  operation.  Thus, 


do  lO  for  all  i  =  1,  100,  2 
10  A (i)  =  B  (i)  +  A  (i) 


operated  in  vector  form  on  the  odd  elements  of  A  and  B  as  was  done  by  loop  (2).  The 
do  for  al  1  index  was  not  limited  to  a  single  dimension.  Thus, 
real  A(100,  50) 

do  lO  for  all  i  =  [1,  100].c.[l,  50]  (5) 

10  A  (i)  =  A  (i)  +  1 

added  one  to  each  element  of  the  100  x  50  array  A.  (The  .  c .  means  Cartesian  pro¬ 
duct,  and  1  represents  a  pair  of  integers  <j,k>  where  j  £  [  1 ,100]  and  k  6  (1,50).)  A  con¬ 
struct  similar  to  do  for  all  was  present  in  early  versions  of  the  Fortran  8X  (Ref.  37) 
draft  standard  but  has  been  removed. 

The  last  three  vector  constructs  we  will  discuss  are  the  ones  presently  adopted  by  the 
Fortran  8X  proposal.  These  were  originally  developed  as  part  of  Vectran  (Refs.  42  and 
43),  an  extension  to  Fortran  developed  at  the  IBM  Houston  Scientific  Center.  The  basic 
construct  is  the  vector  assignment  statement  based  on  triplets,  three  integers  separated  by 
colons  that  specify  beginning  subscript,  ending  subscript,  and  stride.'1  Thus, 

A ( 1 : lOO : 2)  =B(1:100:2)  +  A (1 : lOO: 2)  (6) 

performs  the  operation  and  assignment  on  the  odd  elements  of  A  and  B  and  is  equivalent 
to  loops  (2)  and  (4).  The  triplet  notation  is  complemented  with  the  identify  statement 
used  for  the  selection  of  array  sections  like  matrix  diagonals  (which  cannot  be  expressed 
via  triplets),  and  the  where  statement  used  to  perform  conditional  vector  clement  assign¬ 
ments. 

The  adoption  of  Fortran  8X  will  make  use  of  vector  constructs  more  common.  How¬ 
ever,  this  will  not  rule  out  vectorizing  compilers  as  will  be  discussed  below.  Both  vector 
constructs  and  vcctorization  will  probably  coexist  as  they  do  today  in,  for  example,  Alli- 
ant  Fortran  (Ref.  5). 

A  second  class  of  constructs  are  those  used  to  express  asynchronous  parallelism.  In 
what  follows,  we  will  discuss  Fortran  extensions  assuming  shared  memory  (see  footnote  1). 
Mxtcnsions  to  Fortran  for  systems  without  shared  memory  should  typically  involve  just  a 
few  intrinsic  routines  for  message  passing  and  synchronization  (see,  for  example,  Ref.  2). 

Multitasking  constructs  are  the  more  traditional  ones.  Generally  there  is  some  kind  of 
fork,  process,  or  co-betjin  statement  that  causes  the  start  of  a  new  execution 
stream  that  can  execute  in  parallel  with  the  original  stream.  We  call  this  new  stream  a 
process.  Note  that  the  number  of  processes  started  may  far  exceed  the  number  of  proces¬ 
sors  available  for  parallel  execution.  However,  this  only  inlluences  performance,  since  in 
multitasking  the  operating  system  automatically  multiplexes  processors  and  then-fore 
gives  the  illusion  of  the  availability  of  an  unlimited  number  of  processors. 


4  Stride  is  the  distance  between  successive  array  elements. 
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When  multitasking  is  implemented  in  software,5  there  can  be  a  substantial  amount  of 
overhead  involved  in  setting  up  a  new  process  because  storage  must  be  allocated,  etc. 
Thus,  if  the  granularity  of  the  task5  is  small,  i.e.,  if  the  amount  of  work  to  be  done  by  the 
process  is  small,  then  the  overhead  of  allocating  a  new  process  may  overwhelm  the  useful 
work  done.  To  get  around  the  high  overhead  of  process  allocation,  especially  when  the 
tasks  are  small,  microtasking  is  sometimes  used.  With  microtasking,  it  is  not  necessary  to 
allocate  a  complete  new  process  for  each  task.  What  usually  happens  is  that  the  number 
of  processes  started  is  equal  to  the  number  of  available  processors.  Then  each  process  will 
be  assigned  one  task  but  no  multiplexing  will  be  done.  In  other  words,  the  task  will 
remain  associated  to  the  process  until  it  is  completed,  thus  saving  some  of  the  overhead 
associated  with  process  allocation.  Remaining  tasks  are  allocated  to  processes  only  when  a 
process  becomes  available  by  virtue  of  having  linished  a  task.  This  assignment  of  tasks  to 
processes  is  done  by  the  user  (or  perhaps  the  Fortran  run-time  support  library). 

Microtasking  can  cause  problems,  however.  In  the  case  of  multitasking,  each  task  has 
a  process.  If  any  process  (and  thus  task)  is  blocked,  then  that  process  is  suspended  and 
the  operating  system  automatically  switches  the  processor  to  any  other  ready  process.  For 
example,  suppose  we  have  three  tasks,  a,  f),  and  or,  and  three  processes  a,  b,  and  c. 
Further  assume  that  there  are  only  two  processors  and  they  start  executing  processes  a 
(with  task  a)  and  b  (task  ft.)  If  b  gets  blocked  for  some  reason  (for  example,  waiting  for 
I/O  or  waiting  for  a  signal  from  task  gf),  then  6  is  suspended  which  releases  a  processor 
and  allows  process  c  to  begin. 

Now,  suppose  we  are  microtasking.  Further  suppose  that  there  are  two  processors 
again,  that  the  user  has  asked  for  two  processes,  z  and  y  ,  and  that  task  or  is  assigned  to 
process  z  ,  task  /?  is  assigned  to  y  and  task  -y  remains  unassigned.  Xow  if  (i  is  blocked 
waiting  for  a  signal  from  "I  and  simultaneously  a  is  blocked  waiting  for  a  signal  from  0, 
then  we  have  a  condition  known  as  a  deadlock  —  f)  and  or  cannot  finish  because  they 
need  a  signal  from  -y,  but  -f  can  never  signal  because  ~y  cannot  start  until  either  or  or  0 
have  finished.  Thus  we  have  a  case  where,  if  we  allocated  a  distinct  process  to  each  task, 
no  deadlock  occurs,  whereas  if  we  do  microtasking  and  the  number  of  tasks  is  greater 
than  the  number  of  processes,  we  can  get  a  deadlock.  Of  course,  the  user  (or  support 
library)  can  design  a  more  clever  microtasking  system,  but  this  will  likely  increase  the 
overhead  and  thus  defeat  the  original  reason  for  microtasking. 

Microtasking  systems  allow  only  restricted  types  of  synchronization,  for  example,  crifi- 
cal  regions  and  cascade  synchronization.  The  reason  for  this  restriction  is  to  avoid 
deadlock  situations  like  the  one  discussed  above.  Assume  that  sections  of  code  in  different 
streams  may  be  identified  with  a  name.  The  critical  region  mechanism  guarantees  that  no 
two  processors  will  be  inside  a  critical  section  with  the  same  name  at  the  same  time.  In 
this  case,  we  say  there  is  mutual  exclusion.  For  cascade  synchronization  it  is  assumed  that 
if  a  task  r  signals  another  task  /i,  then  a  process  will  be  allocated  to  t  before  it  is  allo¬ 
cated  to  fi. 

In  an  attempt  to  clarify  these  ideas,  we  will  now  discuss  loop  parallelism.  A  parallel 
loop  whose  iterations  contain  no  synchronization  across  iterations  (except  for  critical  suc- 


b  Multitasking  is  almost  always  implemented  in  software.  The  only  exception  we  know  of  is  the  Denel- 
cor  Mil’  (Itcfs.  IS  and  ’.’5)  where  the  implementation  was  in  hardware.  This  made  multitasking  fast  enough 
to  he  used  to  start  parallel  loops,  and  obviated  the  need  for  tnicrolaskinR. 

6  In  some  contexts  talk  is  considered  a  synonym  of  proem.  Here  the  word  task  will  mean  an  activity  to 
be  performed  by  a  process.  This  activity  may  be,  for  example,  to  execute  a  statement,  a  croup  of  state¬ 
ments,  or  one  iteration  of  a  loop. 


tions)  is  called  a  doall7  loop.  The  defining  characteristic  of  doall  loops  is  that  their 
iterations  may  be  executed  in  parallel  and  that  processors  may  be  allocated  to  iterations 
in  any  order.  An  example  of  such  a  loop  is  the  following: 

doall  i=l,n 

B(i)  =  A  (  i) 

do  while  (B (i) * *2 -A (i)  .gt.  epsilon) 

B(i)  =  (B(i)+A(i)/B(i))/2.0  (7) 

end  do 
end  doall 

Iteration  i  of  this  loop  computes  the  square  root  of  A(i)  using  Newton-Raphson,  and 
assigns  it  to  B(i)  (we  assume  that  A(i)  >  l). 

Doall  loops  should  not  be  synchronized  in  such  a  way  that  a  certain  number  of  pro¬ 
cessors  or  a  certain  processor  allocation  order  would  be  required  for  correct  execution. 
For  example,  the  loop:8 

semaphore  S ( : ) 

V(S(1)) 

V(T(1)) 
doall  i  =  l,rt 

P(S(i)) 

A(i)=A(i-l)*l  (8) 

V(S(i  +  l)) 

P  (T  (i)  ) 

B(i)=3(i-l)*A(i) 

V  (T  (i  +  1)  ) 

end  doall 

is  invalid  since  iteration  i  >  1  cannot  start  execution  until  iteration  i-1  has  started,  and 
this  imposes  an  order  on  processor  allocation.  Thus,  if  only  two  processors  were  available 
at  run  time,  and  they  were  allocated  to  iterations  2  and  3,  the  program  would  never  com¬ 
plete  since  iteration  2  cannot  start  until  semaphore  S  (2)  is  incremented  in  iteration  1. 

Parallel  loops  where  iterations  wait  for  synchronization  signals  from  previous  iterations 
happen  with  some  frequency.  For  these  types  of  loops,  the  doacross  construct  can  be 
used.  This  construct  requires  that  processors  be  allocated  first  to  earlier  iterations.  Thus, 
the  previous  loop  with  the  cioall  keyword  replaced  by  the  doacross  will  be  correct. 

Another  example  of  doacross  is  obtained  by  transforming  the  loop: 


7  From  i-'MP  Fortran  (Kef.  13),  which  used  a  generalized  version  of  I  VIKA  N  s  :.:r  ill  vector 

construct. 

8  In  this  loop,  ?  and  V  arc  the  well  known  synchronization  operations.  These  operate  on  semaphores. 
The  IMS)  operation  tests  the  semaphore  and  if  its  value  is  greater  than  rern,  it  decrements  and 
proceeds.  If  o  is  zero,  the  process  wails  until  a  V  (5)  operation  is  executed.  The  V(S)  operation  checks 
whether  there  are  processes  waiting  on  semaphore  S;  if  so,  it  allows  one  of  them  to  proceed;  otherwise, 
V  (P)  increments  S  by  one.  A  fundamental  characteristic  of  these  operations  is  (Kef.  19): 

/*.  and  V-operations  are  "indivisible  actions';  f.e.  if  they  occur  " simultaneously  in  parallel  processes 
they  are  noninterfering  in  the  sense  that  they  can  be  regarded  as  being  performed  one  after  the  other. 


7 


do  i  =  l.N 

do  3=1, M 

U(i.  j)=U(i-l.;J)+U(i..  j)*U(i*l.j)*U(i.  j-1)  (9) 

ond  do 

end  do 

into  the  following  parallel  equivalent: 

semaphore  S  ( :  ,  : ) 

doacross  1=1, M 
do  3=1, N 

if  (i.r.e.l)  P  (S  (i,  3)  ) 

u(i. j) =u  (i-i, j)+u(i,j)+u (i*i. j)*u (i,3-i)  (1C) 

v (s (1*1. 3) ) 

end  do 
end  doacross 

Doall  loops  can  have  synchronization  instructions  in  their  bodies  as  long  as  they  do 
not  require  a  particular  allocation  order  or  a  minimum  number  of  processors.  This  will  be 
the  case  when  the  synchronization  instructions  are  those  used  to  create  critical  sections. 
For  example,  the  loop:5 

do  1=1, N 

A(K(i))  =  A(K(i))  *  1  (11) 

end  do 

is  equivalent  to  the  loop: 

semaphore  S  ( : ) 

doall  1=1, N 

P(S(K(D)  (12) 

A  (K  (i)  )  =  A  (K  (i) )  *  1 
V (S  (K  (i) ) 
end  doall 

Desidcs  loop  parallelism,  microtasking  has  also  been  used  for  straight  line  parallelism 
when  the  execution  time  of  each  segment  is  relatively  short. 

In  Table  1,  a  summary  of  the  main  features  of  several  parallel  Fortran  dialects  is 
presented.  Fortran  remains  predominant  as  the  supercomputer  programming  language. 
However,  there  is  no  lack  of  advocates  for  other  languages.  Foremost  among  the  con¬ 
tenders  arc  the  functional  languages:  these  include  FP  ( Ref.  7),  ID  (Ref.  39),  VAL  (Ref.  1), 
SISAL  (Ref.  30),  and  ParAlfl  (Ref.  27).  In  functional  programming  there  is  no  global 
state  being  modified  by  the  program,  but  only  functions  mapping  input  values  onto  out¬ 
put  values.  Some  have  claimed  that  these  languages  are  more  appropriate  for  parallel 
processing  due  to  the  lack  of  side  etfccts.  However,  as  far  as  we  know,  there  is  today  no 
high-quality  implementation  of  any  of  these  languages  that  can  successfully  compete  with 
Fortran  in  the  generation  of  efficient  object  code  for  supercomputers. 

®Notice  that  in  this  loop,  if  several  iterations  operate  on  the  same  element  of  A,  as  happens  when 
several  elements  of  K(i)  are  equal,  the  order  in  which  the  iterations  are  done  is  not  important  since  each 
iteration  simply  adds  one  to  an  element  of  A.  However,  it  is  important  that  no  two  iterations  operating  on 
the  same  element  of  A  be  done  in  parallel  or  the  effect  of  some  of  these  iterations  will  be  lost.  This  sequen- 
lialization  is  taken  care  of  by  the  next  loop. 


l-isjuase 

Document 

Da*e 

Vector  Assignment 
Statements 

Vectorizing 

Compiler 

Multitasking 

Microtasking  j 

1 

I'.liac  IV  FORTRAN 

lltcf  11) 

1971 

Control  Vectors 

NO 

NO 

NO  | 

(IJvpnir  I  Ref  33) 

1972 

Yes 

NO 

NO 

■  IVTRAN  :>M  as) 

1973 

DO  FOR  ALL 

YES 

NO 

NO 

Vectran 
;  (Ref.  42) 

1 

1975 

Triplets 

IDENTIFY 

WHERE 

NO 

NO 

NO  j 

11SP  FORTRAN 
^  (Itcf.  12) 

197S 

Triplets 

IDENTIFY 

WHERE 

YES 

NO 

NO 

:  IIEP  FORTRAN 
.'(Rrf.  20) 

1978 

NO 

NO 

Hardware 

Supported 

NO 

FMP  FORTRAN 
'  (Rrf  13) 

1979 

NO 

NO 

NO 

Hardware  Supported  j 
No  Synchronization 

i:  FORTRAN  8X 
!  (Ref.  37) 

i| 

1S87 

Triplets 

IDENTIFY 

WHERE 

N/A 

NO 

NO  j 

;!  Fujitsu  FORTRAN 
"  (Rrf  20) 

1985 

NO 

YES 

NO 

NO  j 

Convrx  FORTRAN 

1985 

NO 

YES 

NO 

NO 

;  IRM  VS  FORTRAN 

(Rrf.  45) 

1985 

NO 

YES 

NO 

NO 

,  C ray  CFT 
(Ref.  Ill 

1980 

NO 

YES 

Cray 

Primitives 

Software  Supported 
Critical  Regions 

!  Alliant  FORTRAN 
•  (Ref  5) 

1 

1985 

Triplets 

YES 

UNIX 

Primitives 

Hardware  Supported  <i 
Cascade  Synchronization  \ 
(Implicit) 

Sequent  FORTRAN 

I  (Ref.  10) 

1 

1985 

NO 

UNIX 

Primitives 

Software  Supported  1 

Critical  Regions  j 

Cascade  Svncrhon:zat:on  .{ 

:  EPF.X/FORTRAN 

I (Ref  501 

1985 

NO 

" 

Software  Supported  ) 
_IUrner  Synchronization  .[ 

Table  1  Characteristics  of  Fortran  Implementations 

An  area  not  widely  explored  so  far  is  that  of  parallel  symbolic  computing.  We  believe 
that  much  more  will  be  done  in  this  area  in  the  future.  For  this  type  of  programming, 
parallel  versions  of  Lisp  (Refs.  21,  21  and  >19),  and  Prolog  (Refs.  15  and  17)  are  being 
developed.  Also,  some  compiler  techniques  to  parallelize  Lisp  programs  have  been 
developed  (Refs.  25  and  26). 

■1  STATE-OF-TIIE-ART  PROGRAMMING  TOOLS 
Restructuring  and  Interactive  Restructurers 

As  we  discussed  above,  many  supercomputers  include  software  for  automatically 
extracting  parallelism  from  what  was  originally  sequential  code.  We  will  start  this  section 
by  presenting  several  examples  of  parallelization.  One  of  the  simplest  transformations  is 
the  one  that  can  be  performed  when  all  iterations  are  independent  of  each  other.  The 
way  to  determine  whether  the  loop  iterations  are  independent  is  by  computing  a  data 
dependence  graph.  We  will  not  define  this  graph  here;  more  information  on  dependence 
graphs  and  program  parallelization  may  be  found  in  Refs.  (30)  and  ( 11).  In  this  paper  we 
will  limit  ourselves  to  a  few  examples  of  transformations. 

All  iterations  in  the  following  loop  arc  independent: 


(13) 


do  i=l. n 

if  (A(i)  .  gt.  0)  then 

B(i)  =  C(i)  +  D  (i) 

E(i)  =  F(i) 

end  i  f 

end  do 

and  therefore  it  can  be  transformed  to: 

where  (A(l:n)  . gt .  O) 

B  ( 1 :  n)  =  C  (1 :  n)  +  D(l:n) 

E(l:n)  =  F(l:n)  (14) 

end  where 

or  to: 

doall  i=l,n 

if  (A(i)  .gt.  O)  then 

B(i)  =  C  (i)  +  D  (i) 

E(i)  =  F(i)  (15) 

end  i  f 
end  doall 

or  to: 

doall  i=l.n.K 

•a  =  min(i*K-l,n) 
where  (A(i:m)  . gt .  O) 

B(i:m)  =  C(i:m)  +  D(i:m) 

E (i :m)  =  F (i  :  m)  (16) 

end  where 
end  doall 

Parallelizing  loops  is  possible  even  when  loop  iterations  are  not  independent  —  paral¬ 
lelizing  compilers  could  transform  do  loops  into  doacross  loops  by  inserting  the 
appropriate  synchronization  instructions.  For  example,  a  parallelizing  compiler  could 
transform  do  loop  (9)  to  doacross  loop  (10). 

Sometimes  secondary  transformations  are  necessary  before  a  loop  can  be  parallelized. 
For  example,  the  loop: 

do  1=1, n 

A  =  B  ( i )  +  C  (i) 

D(i)  =  A  +  1  (17) 

end  do 

cannot  be  parallelized  because  all  iterations  use  the  variable  A  to  store  intermediate 
values. 

A  transformation  called  scalar  expansion  will  make  the  iterations  of  the  previous  loop 
independent  by  changing  A  into  an  array: 
do  il , n 

AX  (  i )  B(i)  -  C(i) 

n(i)  -  ax ( i )  -  i  (la) 

end  do 
A  =  AX  (n) 

Another  important  secondary  transformation  is  loop  interchanging.  This  transforma¬ 
tion  makes  it  possible  to  tnap  either  of  the  following  two  doubly-nested  loops  into  the 
other: 


do  i=l. n 

do  j  =  1 . n 

A(i. J)  =  A ( 1 , j-1)  -  1 

end  do 

end  do 


do  j  =  1 . n 

do  i=l,n 

A(l.j)  =  A(i.j  l)  *  1 

end  do 

end  do 


If  the  input  loop  is  the  first  one  above,  and  the  target  machine  is  a  vector  machine,  the 
compiler  will  first  transform  the  first  loop  into  the  second  and  then  vectorize  the  inner 
loop. 

On  the  other  hand,  if  the  input  loop  is  the  second  one  and  the  target  machine  is  a  mul¬ 
tiprocessor,  the  compiler  will  first  transform  the  second  loop  into  the  first  loop  and  then 
transform  the  outer  loop  into  a  doal  1  loop.  This  is  done  to  pay  only  once  the  overhead 
involved  in  starting  the  doall  loop. 

The  final  secondary  transformation  we  will  discuss  is  blocking.  This  transformation  is 
used  mainly  for  memory  management.  For  example,  assume  a  vector  machine  where  all 
arithmetic  machine  instructions  are  register-to-registcr,  and  that  the  vector  registers  are 
32  words  long.  The  loop: 


do  i  =  l . n 

A  (i)  =  B(i) 

end  do 


can  be  transformed  into: 


do  1=1. n. 22 

do  j =1 , min (i- 31 , n) 

A  C  j )  =  3(3)  *  C(j) 

end  do 

ond  do 


The  inner  loop  can  be  vectorized  as  follows: 
do  1=1. n, 32 

m  =  min (i*31 . n) 

A(i:n)  =  B(i:n)  *  C(i:n) 

end  do 


The  vector  operations  can  now  be  mapped  into  vector  register  instructions  as  shown 
next  (vrl,  vr2  and  vr 3  arc  32-element  vector  registers) 

do  i  =  l , n, 32 


:a  = 

n In  ( 1 

*  3  l ,  n) 

vr  1 

=  A  (  1 

:n) 

vrl 

=  C(i 

:n) 

vr  3 

=  vrl 

*  vr 2 

A  (i 

:m)  = 

vr  3 

l.et  us  now  discuss  a  more  complex  example.  Assume  a  multiprocessor  with  a  cache 
memory  on  each  processor.  Further  assume  that  the  cache  (and  thus  the  memory)  is 
divided  into  blocks  of  K  words  each,  and  that  data  is  only  exchanged  between  memory 
and  cache  as  whole  blocks.  Assume  also  that  matrix  columns  are  seipiences  of  blocks  (i.c., 
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matrices  are  stored  in  column-major  order  and  columns  are  much  bigger  than  blocks). 

Consider  now  the  loop: 

do  i=l,N 

do  j=l.N 

B(J.i)  =  A(i.j)  +  1  (25) 

end  do 

end  do 

A  naive  compiler  might  transform  the  outer  loop  into  a  doall  without  any  other 
transformations,  causing  (1+l/K)  block  transfers  between  memory  and  caches  for  each 
assignment  executed. 

To  improve  this  situation,  the  compiler  could  block  both  loops  into  groups  of  K  itera¬ 
tions.  This  would  have  the  effect  of  transposing  the  matrix  by  KxK  submatrices  or 
blocks,  thus  the  name  loop  blocking.  After  blocking  and  interchanging  loops,  we  end  up 
with  the  loop: 

do  io=l.N,K 

do  jo=l,N,K 

do  i=io, io-t-K- 1 

do  j  =  j o , j o+K- 1 

B(j.i)  =  A(i.j)  +  1  (26) 

end  do 

end  do 

end  do 

end  do 

If  the  outer  loop  is  now  transformed  into  a  doall  loop,  the  number  of  cache  block 
transfers  decreases  to  2/K,  a  clear  improvement  over  the  naive  approach  when  K  is  not 
small. 

To  conclude  the  work  on  this  loop  we  need  to  block  once  more  for  vector  registers,  vec¬ 
torize  the  innermost  loops,  and  map  into  vector  register  instructions:10 

doall  io=l,N,K 

do  jo=l,N,K 

do  i  =  io,  io  i-K-1 

do  j=3o,)o+K-l,32 

m=min (  j+31,  n) 
vrl  =  A ( i .  j :rn) 

vr2  =  vrl  ♦  1  (27) 

B (j :m,  i)  -  vr2 

end  do 

end  do 

end  do 

end  do 

Even  though  for  presentation  purposes  the  previous  examples  were  shown  as 
sourcc-to-source  transformations,  parallelization  is  most  often  performed  inside  the  com¬ 
piler,  and  usually  the  programmer  is  informed  of  the  transformations  only  via  annotated 
program  listings.  The  annotations,  when  used  in  conjunction  with  information  on  execu¬ 
tion  time  of  program  segments  (known  as  a  program  projilc),  identify  those  segments  of 

10  Some  transformations  like  vectoriration  actually  make  the  code  easier  to  read.  On  the  other  hand, 
loop  blocking  is  an  example  of  a  transformation  that  the  user  would  prefer  not  to  see  —  it  certainly  does  not 
make  the  code  any  easier  to  read  even  though  it  makes  it  more  ellicicnt. 


code  for  which  the  programmer,  in  his  quest  for  speedup,  should  rewrite  or  expand  with 
parallelization  directives  to  the  compiler.  Two  examples  from  Alliant  Fortran  are  the 
commands  cvdSq  cncall  and  cvd$q  nosync.  The  first  one  allows  loops  to  be  paral¬ 
lelized  in  the  presence  of  subroutine  calls.11  The  second  command  allows  parallelization  of 
loops  even  if  several  iterations  assign  values  to  the  same  memory  location. 

The  existence  of  these  annotations  indicates  that  parallelization  is  different  from  tradi¬ 
tional  compiler  optimizations.  Thus,  a  compiler  performs  register  allocation  and  usually 
common  subexpression  elimination,  but  it  never  informs  the  user  on  how  successful  it  was 
in  applying  these  transformations.  The  major  difference  between  regular  optimizing  com¬ 
pilers  and  vectorizing/parallelizing  compilers  is  that  the  benefits  of  vectorization  are 
potentially  higher,  and  cither  program  rewriting  or  parallelization  commands  may  be 
necessary  to  obtain  efficient  parallelism.  An  example  of  such  a  situation  is  provided  by 
the  transformation  that  takes  a  Fortran  do  loop  and  transforms  it  into  a  vector  assign¬ 
ment  statement.  One  piece  of  information  required  for  this  transformation  is  an  analysis 
of  the  array  subscripts  inside  of  the  do  loop.  This  analysis  can  be  performed  when  the 
subscripts  are  linear  functions  of  the  loop  indices.  When  this  analysis  cannot  be  per¬ 
formed,  vectorization  is  precluded.  For  example,  the  loop: 

do  i=l , lOO 

A(K(i))=A(K(i))  +  1.1  (28) 

end  do 

cannot  be  vectorized12  if  nothing  is  known  about  vector  K.  One  way  to  get  this  informa¬ 
tion  is  via  assertions.  In  this  case,  the  compiler  will  vectorize  only  if  the  programmer 
asserts  that  K(i)  #  K  ( j )  whenever  i  ^  J. 

Depending  on  the  target  machine,  the  annotations  provided  to  the  programmer  may 
vary  in  complexity.  In  some  cases  it  may  be  appropriate  to  show  the  programmer  a 
source  code  version  of  the  parallelized  program.  This  approach  is  followed  by  those  paral- 
lelizcrs  that  perform  source-to-sourcc  translation  such  as  Parafrasc  (Ref.  31),  IvAP  (Refs. 
17  and  18),  VAST  (Ref.  9),  and  PFC  (Ref.  3).  The  output  of  these  restructurers  includes 
some  form  of  parallel  constructs.  Since  no  standards  exist  for  such  constructs,  there  is  no 
uniformity  even  though  some  form  of  vector  extensions  from  Fortran  8x  are  frequently 
adopted. 

The  input  parallelization  commands  can  also  be  replaced  by  parallel  constructs.  Thus, 
instead  of  specifying  that  a  loop  may  be  vectorized,  the  programmer  may  write  a  vector 
assignment  statement.  However,  portability  suffers  when  parallel  extensions  are  used 
since  there  is  no  standard  for  those  extensions. 

In  the  past,  vectorizing  compilers  have  been  considered  only  in  the  context  of  dusty 
decks13.  We  believe  that  restructurers  (source-to-sourcc  translators)  are  also  useful  in 

11  When  compilers  are  faced  with  subroutine  calls  inside  of  loops,  they  usually  assume  the  worst  —  that 
the  loop  cannot  be  vectorired  or  multiprocessed.  Hccent  work  (fiefs.  10,  14,  28,  and  52)  suggests  that  some 
interproccdura!  analysis  can  be  dune  to  permit  transformations  in  some  cases,  ilul  user  assertions  are  still 
going  to  be  important. 

12  lly  vectoriting  we  mean  transforming  loop  (28)  into  the  single  statement 

A  (K  ( 1 :  ICO)  )  e.  A  (K  ( 1 :  ICO)  )  *  1.1 

We  should  point  out  that  even  if  nothing  is  known  about  K  we  could  transform  (28)  into  a  sequence  of  sec¬ 
tor  statements,  where  the  first  ones  do  some  sort  of  runtime  dependence  testing. 

11  The  term  dusty  deck  refers  to  old  (dusty)  programs  that  need  to  be  compiled  without  human  rewrit¬ 
ing,  either  because  the  cost  of  manpower  needed  for  the  rewriting  is  prohibitive,  or  because  nobody  under¬ 
stands  the  programs  any  more,  borne  people  fee!  that  the  only  use  for  sophisticated  restructuring  compilers 
is  for  processing  dusty  decks  —  that  new  languages  allowing  the  explicit  expression  of  parallelism  will  obvi¬ 
ate  the  need  for  these  compilers.  One  look  at  the  results  of  loop  blocking  above  should  convince  anyone  that 
even  if  programmers  can  use  explicit  parallelism,  there  are  some  optimisations  belter  left  to  the  compilers. 


other  contexts.  Specifically,  a  rcstriicturcr  could  be  used  to  free  the  programmer  (at  least 
to  some  extent)  from  performance  considerations  and  let  him  concentrate  on  correctness. 
We  could,  therefore,  conceive  of  programming  parallel  computers  as  a  two-step  process. 
First,  a  program  would  be  written  and  tested.  Once  the  programmer  was  convinced  of  its 
correctness,  the  program  would  be  transformed  into  an  efficient  version  through 
automatic  means.  Clearly,  things  will  not  always  work  out  in  this  way  since  achieving 
efficiency  might  involve  changing  the  program  in  ways  beyond  the  capability  of  current 
restructurers.  However,  we  believe  that  this  two-step  process  could  often  be  applied. 

In  the  process  of  restructuring,  interaction  with  the  user  may  be  needed  due  to  one  of 
the  following  reasons: 

•  A  certain  transformation  is  valid  only  if  the  user  supplies  an  assertion.  The  vectori- 
zation  of  loop  (28)  is  an  example  of  this  situation. 

•  The  rcstructurer  needs  information  from  the  user  to  decide  how  to  transform  a  con¬ 
struct.  For  example,  a  vector  assignment  where  only  some  of  the  array  elements  are 
to  be  assigned  may  be  transformed  into  a  sequence  of  vector  operations  including 
gathcr/scatter  operations  or  into  vector  operations  that  mask  some  of  the  assign¬ 
ments.  Knowing  the  density  of  the  array  elements  to  be  assigned  is  necessary  in  this 
case  (See  Ref.  20). 

•  The  restructurer  may  not  have  a  model  of  the  target  machine,  and  therefore  it  will 
be  unable  to  decide  by  itself  what  transformations  to  apply  even  if  the  program 
behavior  is  known  at  compile  time. 

Most  of  the  restructurers  today  are  batch  restructurers.  Some  of  them  interact  with 
the  programmer  (also  in  batch  mode)  by  requesting  information  in  the  listing  and  accept¬ 
ing  comment  cards  with  information  supplied  by  the  programmer.  However,  interactive 
restructurers  have  clear  advantages,  and  are  beginning  to  emerge. 

Debuggers 

Much  has  been  done  in  the  area  of  symbolic  debuggers  over  the  last  ten  years.  Many 
user  friendly  tools  such  as  Berkeley  Unix’s  dbx  (Ref.  53)  and  Apollo’s  debug  (Ref.  G) 
have  done  an  excellent  job  of  providing  an  environment  that  allows  controlled  probing 
and  analysis  of  application  programs.  Debuggers  of  this  type  typically  support  a  set  of 
tools  that  can  be  used  to  determine  the  state  of  an  object  program  at  any  point  in  its  exe¬ 
cution.  These  tools  include  the  facilities  for  setting  breakpoints,  monitoring,  examining, 
and  tracing  variables  (which  can  be  symbolically  referenced),  and  single-stepping  the  tar¬ 
get  program.  This  set  of  functions  is  usually  sufficient  to  allow  the  user  to  determine  the 
unique  state  of  a  single  processor  application  within  the  context  of  the  original  source  pro¬ 
gram.  Supercomputer  applications,  however,  present  a  more  difficult  problem  to  the 
debugging  programmer.  These  programs  use  vector  and/or  multitasking  parallelism  in 
order  to  achieve  their  high  computation  speed.  Parallelism  in  programs  introduces  new 
wrinkles  that  make  it  difficult  to  extend  serial  debugging  tools  to  the  parallel  domain. 
Additionally,  as  mentioned  earlier,  the  optimizing  compiler  tools  that  usually  accompany 
the  supercomputer  hardware  do  extensive  restructuring  of  the  original  source  program, 
widening  the  gap  between  the  user’s  perception  of  the  application  and  the  run-time 
representation. 

Vector  parallelism  by  itself  is  not  inherently  complicated.  Languages  that  include  vec¬ 
tor  constructs  exist  today,  and  compilers  for  these  languages  can  map  high  level  vector 
constructs  directly  into  vector  instructions  for  the  pipelined  vector  architectures.  Pro¬ 
grams  that  explicitly  use  only  vector  parallelism  can  be  analyzed  with  traditional 
debuggers  having  minimal  extensions,  since  these  programs  still  execute  through  a  single 
stream  of  statements  at  one  time.  The  problem  arises  when  serial  programs  are  passed 
through  vectorizing  restructurers.  These  optimizers  can  significantly  change  the 


appearance  of  the  original  source  code,  making  it  difficult  for  a  user  to  reconcile  the  edit¬ 
time  and  run-time  states  of  his  program.  For  this  reason,  the  reporting  of  run-time  errors 
may  not  make  sense  when  returned  to  the  user,  forcing  him  to  recompile  his  program 
without  optimization  and  rerun  the  application  in  order  to  help  isolate  problems. 

Multitasking  applications  present  a  more  serious  challenge  to  debugger  technology  (as 
well  as  ideology).  In  the  parallel  execution  domain,  the  concept  of  a  breakpoint  is  not 
clear.  In  the  parallel  environment  the  notions  of  global  and  local  effects  must  be  con¬ 
sidered.  A  local  effect  is  one  that  is  administered  to  a  small  group  (possibly  one)  of  pro¬ 
cessors,  whereas  a  global  effect  is  one  that  is  administered  to  all  processors  related  to  the 
execution  of  a  particular  program.  A  breakpoint  applied  to  one  portion  of  code  executing 
on  one  processor  may  or  may  not  be  expected  to  halt  all  other  processors  executing  the 
same  code.14  Furthermore,  should  a  global  effect  for  that  breakpoint  be  desirable,  the 
question  of  granularity  must  be  resolved,  namely,  how  quickly  after  the  breakpoint  is 
reached  can  all  of  the  other  processors  be  stopped?  A  similar  problem  exists  with  respect 
to  the  variable  name  space.  Tracing  and  examining  specific  variable  names  becomes  more 
tedious  when  several  processors  are  executing,  each  with  its  own  copy  of  the  same  routine 
(and  hence  the  same  list  of  local  variables).  Tracing  of  such  variables  must  be  qualified 
with  additional  information  that  identifies  the  processor  or  group  of  processors  of  interest. 

The  most  challenging  aspect  of  parallel  debugging  is  the  timing  conflicts  introduced  by 
interacting,  independently  running  processors.  The  scries  of  states  through  which  a  serial 
program  passes  is  not  time  dependent  and  is  therefore  repeatable,  providing  the  opportun¬ 
ity  for  an  unlimited  number  of  reruns  in  order  to  localize  run-time  anomalies.  The  set  of 
states  through  which  a  parallel  program  passes  is  dynamic  and  very  sensitive  to  the  speed 
at  which  each  processor  is  progressing.  For  this  reason,  program  errors  may  surface  infre¬ 
quently.  Furthermore,  these  timing  or  synchronization  errors  might  be  completely 
masked  when  software  debugging  instrumentation  is  inserted  into  the  code  (thus  changing 
the  run-time  image). 

Another  debugging  problem  is  the  nature  of  supercomputer  application  programs. 
These  programs  tend  to  manipulate  large  quantities  of  single-  and  double-precision 
floating-point  numbers  to  perform  their  tasks.  Finding  errors  in  output  listings  from 
lengthy  computations  can  be  user-intensive  and  time-consuming.  New  methods  of  render¬ 
ing  this  information  in  the  form  of  graphic  images  must  be  used  to  present  large  volumes 
of  information  in  a  concise  manner. 

Many  solutions  to  the  parallel  debugging  problem  are  starting  to  appear  in  industry 
and  academia.  One  such  solution  is  Pdbx  developed  by  Sequent  Computer  Systems, 
Incorporated  (Ref.  46).  Pdbx  is  an  enhanced  version  of  dbx  that  supports  debugging  of 
multiple  process  applications  on  Sequent's  shared  memory  multiprocessor  machine.  In 
addition  to  the  functionality  of  dbx,  Pdbx  supports  the  debugging  of  multiple  Unix 
processes.  Supported  are  such  features  as  breakpoints  for  one  or  more  processes,  indepen¬ 
dent  examination  and  tracing  of  individual  processes,  and  the  use  of  multiple  terminals  or 
“windows”  for  monitoring  multiple  processes.  While  Pdbx  provides  no  facilities  to  con¬ 
trol  the  repeatability  of  a  parallel  program,  it  does  extend  the  functions  of  a  traditional 
serial  symbolic  debugger  to  provide  some  tools  for  probing  the  execution  of  parallel  pro¬ 
grams. 

Instant  Replay,"™  developed  at  the  University  of  Rochester,  is  another  debugging 
environment  targeted  at  helping  users  debug  parallel  programs  on  the  11HN  Butterfly 
(Ref.  31).  Instant  Replay  attacks  the  repeatability  problem  by  regulating  and  recording 
access  to  shared  data  objects.  By  introducing  some  small  run-time  overhead  (variable, 

14  CSRD  is  currently  researching  the  question  of  whether  global  rflrcla  are  necessary  or  ilrsirable  whrn 
debugging  parallel  programs. 


but  as  low  as  one  to  ten  percent  in  some  applications),  Instant  Replay  attaches  aging 
information  to  all  shared  objects  and  records  revision  numbers  as  these  objects  are 
updated  and  disseminated.  In  addition  to  recording  this  revision  information,  the  run¬ 
time  system  has  the  ability  to  “replay”  the  application  while  insuring  the  same  access 
sequences  to  shared  objects.  This  gives  the  programmer  the  capability  to  perform  the 
cyclic  rerunning  necessary  to  do  incremental  debugging  on  a  parallel  machine. 

The  future  will  probably  bring  developments  in  several  areas  in  response  to  the 
increased  challenge  of  constructing  and  debugging  parallel  programs.  Hardware  enhance¬ 
ments  represent  one  necessity.  New  supercomputer  designs  will  continue  to  incorporate 
an  increasing  amount  of  instrumentation  to  support  run-time  monitoring  and  control. 
This  special-purpose  hardware  is  necessary  to  monitor  the  execution  history  of  parallel 
programs  in  a  non-intrusive  way.  Additionally,  low-level  hardware  support  can  be  used  to 
ellect  a  parallel  breakpoint  mechanism  that  cannot  be  cleanly  achieved  in  software. 

Other  tools  to  be  expected  in  the  future  include  pre-compilation  and  post^run  analysis 
tools.  Designers  of  parallel  software  will  be  able  to  use  these  tools  to  analyze  interproces¬ 
sor  communication  and  cite  coding  sequences  that  can  be  potentially  time  dependent. 
Such  an  analysis  system  is  being  developed  at  the  University  of  Illinois  at  Urbana- 
Champaign  (Ref.  4).  The  system  consists  of  a  diagnostic  compiler  that  can  warn  the  user 
of  source  code  sequences  that  contain  guaranteed  race  conditions  as  well  as  potential  race 
conditions  that  cannot  be  determined  with  certainty  at  compile  time.  To  help  ascertain 
the  behavior  of  indeterminate  race  conditions,  the  compiler  will  automatically  insert  the 
appropriate  instrumentation  into  the  object  code  to  investigate  data  reference  behavior. 
Trace  data  generated  by  this  instrumentation  can  then  be  inspected  (both  manually  and 
automatically)  to  help  determine  the  status  of  potential  timing  hazards  recognized  before 
execution.  Systems  such  as  this  serve  not  only  to  assist  users  in  locating  nundeterministic 
code  sequences  in  parallel  applications,  but  also  to  help  users  improve  program  perfor¬ 
mance.  By  allowing  the  compiler  to  instrument  an  application  in  areas  of  uncertain  race 
conditions,  the  user  can  potentially  learn  of  assertions  he  could  add  to  his  code  that  would 
eliminate  conservative  assumptions  that  might  normally  be  made  by  the  restructuring 
compiler. 

Improvements  in  development  environments  will  also  help  to  ease  the  burden  of  using 
supercomputers  effectively.  Knowledge-based  systems  could  play  a  major  role  in  the 
implementation  of  an  error-free  program.  By  guiding  the  user  through  the  selection  from 
a  library  of  optimized  numerical  kernels,  expert  systems  can  aid  the  user  in  locating  reus¬ 
able,  bug-free  code.  This  is  one  way  the  environment  can  help  to  reduce  the  potential 
number  of  software  errors  during  development. 

Another  way  is  through  the  use  of  new  knowledge-based  debugging  tools.  An  intelli¬ 
gent  parallel  debugging  system  could  work  in  conjunction  with  the  system  compilers  and 
run-time  monitoring  system  to  ascertain  the  required  information  for  a  processor-lime 
graph  (Figure  1).  Once  an  error  is  recognized  in  the  program,  the  programmer  could 
invoke  a  debugging  expert  that  would  ask  a  scries  of  questions  relating  to  the  error  to 
help  the  user  isolate  the  problem.  In  the  case  of  an  intermittent  synchronization  error, 
the  expert  system  could  reference  the  processor-time  graph  and  supply  a  mapping  of  exe¬ 
cution  time  to  source  code  lines  to  aid  the  user  in  looking  for  programming  errors.  In  the 
case  of  Figure  1,  the  programmer  would  be  directed  to  review  the  rode  that  executes 
between  times  t^  to  t(  and  times  t.  to  t9  since  maximum  parallelism  occurs  during  these 
times  and  is  therefore  likely  to  be  the  cause  of  intermittent  synchronization  problems. 

Finally,  continuing  improvements  in  restructuring  compiler  technology  will  reduce  the 
pressure  on  individuals  to  lind  and  implement  parallelism  in  application  programs.  While 
parallel  programming  must  be  encouraged  in  order  to  build  applications  that  achieve  t lie 
highest  levels  of  performance,  automatic  optimization  systems  will  allow  more  casual  users 
to  develop  and  debug  an  application  in  a  serial,  reproducible  environment  —  leaving  the 


time 

Figure  1  Sample  processor-time  graph 


correct  parallelization  transformations  to  the  compiler. 


Performance  Evaluation 

Performance  evaluation  on  any  machine  has  traditionally  been  a  difficult  and  much 
neglected  task.  Supercomputers  worsen  the  problem  with  a  complicated  arsenal  of 
hardware  and  software  additions  that  require  new  levels  of  understanding  by  the  user. 
Many  of  the  architectural  features  used  by  supercomputer  designers  to  gain  computa¬ 
tional  speed  are  the  same  features  that  make  program  performance  difficult  to  track. 
Pipelined  vector  arithmetic  units,  shared  caches,  interconnection  networks,  and  shared 
memories  are  just  a  few  aspects  of  supercomputer  systems  that  can  be  responsible  for  a 
wide  variance  of  execution  times  for  a  particular  application,  depending  on  their  opera¬ 
tional  efficiency.  These  architectural  components  are  complex  and  interrelated,  resulting 
in  a  run-time  environment  that  is  difficult  to  trace  and  analyze. 

State-of-the-art  performance  evaluation  tools  for  existing  single-processor  machines  are 
helpful  when  extended  into  the  parallel  domain,  but  remain  too  simplistic  to  do  a  com¬ 
plete  job.  These  utilities  help  programmers  understand  the  run-time  resources  consumed 
by  an  application  at  a  high  level,  without  providing  insight  into  the  machine  level  interac¬ 
tion  that  might  he  involved.  The  I'nix  operating  system  supports  performance  analysis 
for  programs  written  in  Fortran.  C,  and  Pascal  through  a  sot  of  profiling  tools  and  system 
timer  rails.  One  of  the  profiling  tools  is  the  gprof  (Itefs.  oil,  22,  23)  utility  which  a. lows 
users  to  accumulate  call  counts  and  <\eculion  times  on  a  subroutine  basis  by  sampling  '.fie 
program  counter  of  the  running  application  at  regular  intervals.  This  information  d  coj- 
fi  eted  through  the  use  of  special  run-time  code  inserted  into  the  object  module  on  demand 
by  the  user  at  compile  time.  The  profiling  technique  is  very  useful  for  isolating  specific 
routines  that  account  for  major  portions  of  an  application's  execution  time,  but  does  il'tle 
for  helping  the  user  to  improve  efficiency  and  utilization  once  the  trouble  areas  have  m  en 
located. 

For  example,  the  following  code  fragment: 


’«»*** 


_  *w  .1*  '  k»  '  i 


•Vl*  .llL  Jl 


do  lO  I  =  1,  100 

do  10  J  =  1,  ICO 

do  10  K  =  1,  ICO 

A(I.  J.  K)  =  O  (29) 

end  do 

end  do 

end  do 

is  an  example  of  a  loop  that  will  generate  an  excessive  number  of  page  faults  when  exe- 
cutcd  in  a  virtual  memory  system.  The  reason  for  the  performance  problem  is  that  the 
array  is  referenced  in  the  “wrong"  order.  With  Fortran  arrays,  element  A(l,l,l)  is 
adjacent  in  memory  to  clement  A(2,l,l),  not  A(l,l,2)  .  For  that  reason,  each 
write  to  A  ( I .  J ,  K)  may  require  a  disk  access  instead  of  a  simple  memory  write,  fn  this 
example,  knowing  that  the  routine  which  contains  this  loop  is  responsible  for  a  significant 
percentage  of  the  execution  time  for  a  particular  application  is  useful  but  not  sufficient, 
inexperienced  programmers  who  have  not  seen  this  effect  may  be  unable  to  correct  perfor¬ 
mance  deficiencies  of  this  sort.  The  analysis  given  by  the  run-time  system  should  include 
more  detailed  information  about  the  nature  of  the  delays  incurred  in  the  designated  rou¬ 
tine  along  with  suggestions  about  whether  the  observed  performance  is  “reasonable." 

While  the  above  example  represents  a  simple  problem  that  can  also  be  seen  in  unipro¬ 
cessors,  other  examples  for  parallel  machines  can  be  more  subtle.  Consider,  for  example, 
an  application  that  applies  a  set  of  n  processors  using  a  shared  cache  to  do  a  computation 
on  n  individual,  independent  arrays.  Depending  on  the  size  of  the  arrays,  the  access  pat¬ 
tern  of  each  processor,  and  the  cache  algorithm  used  by  the  memory  system,  the  speedup 
seen  for  this  application  could  be  anywhere  from  n  down  to  numbers  less  than  1.1S  Speed- 
ups  of  less  than  1  can  result  from  each  of  the  n  processors  overwriting  cache  blocks  that 
have  just  been  loaded  by  one  of  the  other  processors.  If  perfectly  timed,  each  processor’s 
memory  reference  would  result  in  a  cache  miss,  thus  nullifying  the  usefulness  of  the 
cache.16  While  the  situation  described  is  a  pathological  case,  some  interference  effects  do 
exist  in  the  caching  system  that  can  impact  performance.  Additionally,  other  aspects  of 
the  machine  operation  such  as  contention  in  the  interconnection  network  also  impact  the 
program's  performance.  Complete  analysis  of  the  performance  of  applications  in  the  pres¬ 
ence  of  these  effects  requires  the  ability  to  capture  this  information. 

As  was  the  case  with  debugging,  restructuring  compilers  complicate  the  effort  of  tuning 
supercomputer  programs.  Again,  the  user  is  faced  with  the  problem  of  reconciling  run¬ 
time  trace  and  performance  information  with  source  code  listings  that  do  not  completely 
match.  More  recent  optimization  techniques  such  as  subroutine  expansion  (Refs.  28,  35) 
pose  some  of  the  more  difficult  problems,  since  many  performance  tools  tend  to  collect 
data  at  the  subroutine  level  and  this  transformation  erases  that  modularity.  Subroutine 
expansion  enlarges  the  granularity  of  the  monitored  program,  giving  the  user  less  detailed 
information  about  the  nature  of  the  run-time  performance. 

Another  aspect  of  computing  in  general  (and  supercomputing  in  particular)  that  is 
troublesome  for  performance  evaluation  is  multiprogramming.  Since  investments  in 
supercomputer  hardware  can  be  quite  large,  there  is  a  strong  incentive  to  achieve  max¬ 
imum  utilization  of  available  machine  time.  Utilization  of  supercomputers  is  usually 
enhanced  through  multiprogramming.  This  causes  problems  for  the  performance  analysis 

11  Speedup  is  /  Tn,  where  T(  is  the  time  required  to  execute  the  application  serially  end  Tn  is  the 
time  required  to  execute  the  application  using  n  processors.  hpeedups  of  less  than  1  indicate  that  the  appli¬ 
cation  runs  slower  in  the  parallel  environment  than  it  would  hare  serially. 

In  fact,  since  the  caching  mechanism  introduces  some  overhead,  the  resulting  application  might  run 
slower  than  it  would  with  a  single  processor  system  with  no  cache. 
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system  by  complicating  the  hardware  and  software  instrumentation.  Multiprogrammed 
systems  require  enhanced  performance  analysis  instrumentation  in  order  to  do  the  neces¬ 
sary  accounting  for  multiple  jobs.  For  job  swaps  and  operating  system  calls,  more  con¬ 
text  information  must  be  saved  to  insure  integrity  of  individual  job  accounting. 
Hardware  and  software  probes  must  be  turned  off  and  on  during  context  switches  instead 
of  running  continuously  as  in  a  monoprogrammed  environment. 

Multiprogramming  also  causes  problems  for  the  user  attempting  to  improve  an 
application’s  execution  performance.  Applications  running  in  monoprogrammed  environ¬ 
ments  are  easily  evaluated  by  examining  the  apparent  execution  time  of  execution  or 
“wall  clock”  time.  Users  optimizing  applications  in  such  an  environment  need  only  make 
changes  and  execute  again,  comparing  the  new  wall  clock  time  to  that  of  previous  runs. 
While  wall  clock  time  is  a  useful  metric  to  gauge  the  performance  of  applications  in  these 
monoprogrammed  environments,  it  is  not  useful  in  understanding  the  performance  of  an 
application  in  a  multiprogrammed  system.  System  loading,  scheduling  algorithms, 
resource  availability,  and  other  factors  uncontrollable  by  the  user  all  contribute  to  the 
apparent  execution  time  of  an  application.  The  user  will  find  that  two  successive  runs  of 
an  identical  program  could  vary  greatly. 

Future  programming  environments  will  include  many  enhancements  to  the 
hardware/software  configurations  that  are  oirered  by  supercomputer  vendors  if  applica¬ 
tion  programs  are  to  perform  efficiently  on  their  machines.  First,  hardware  enhancements 
will  be  necessary  to  achieve  many  of  the  performance  evaluation  goais  currently 
envisioned.  Some  manufacturers  have  already  realized  the  need  for  including  such  spe¬ 
cialized  hardware  as  part  of  their  standard  machine  configurations.  Cray  Research.  Inc. 
manufactures  a  machine  called  the  X-MP  which  includes  a  hardware  performance  monitor 
that  comprises  a  set  of  counters  that  can  monitor  certain  hardware-related  events  (Ref. 
32).  These  counters  track  events  such  as  floating-point  operations,  instruction  fetches. 
I/O  and  CPU  memory  references,  and  vector  operations  on  a  per-CPU  basis.17  As  super- 
computing  experience  accumulates,  the  heightened  awareness  of  the  need  for  performance 
information  will  drive  other  manufacturers  to  provide  a  basic  set  of  hardware  instrumen¬ 
tation  that  can  be  used  for  performance  and  correctness  tracking. 

Perhaps  the  greatest  potential  for  improvement  in  the  area  of  performance  evaluation 
is  that  of  presentation  techniques.  While  data  capturing  facilities  in  system  hardware  and 
software  are  evolving,  innovations  in  the  area  of  data  rendering  are  slow  in  coming.  The 
volume  of  run-time  trace  data  that  could  be  of  interest  in  a  parallel  execution  environ¬ 
ment  is  too  mxssive  to  represent  in  raw  f6rm.  Of  particular  interest  are  areas  of  interac¬ 
tion  between  multiple  processors  that  are  synchronizing  in  some  way.  A  detailed  analysis 
of  this  environment  could  require  an  extensive  log  of  time-stamped  accesses.  Such  a  log  is 
usually  not  practical  to  review,  possibly  consisting  of  several  hundreds  of  pages  of  entries. 
More  interesting  techniques  involving  concise  graphic  representations  must  be  developed 
to  make  this  information  more  usable. 

New  innovations  in  the  area  of  pre-compilation  and  pre-execution  performance  analysis 
tools  can  also  be  expected.  These  tools  might  take  several  forms  to  aid  'he  user  at 
different  times  during  the  program  development  cycle.  One  such  tool  might  be  a  protram 
analyzer  that  could  evaluate  the  use  of  system  library  routines  anti  present  t  stimuli  s  of 
'he  run-lime  performance  of  an  application  based  on  past  execution  statistics  ol  '.he 
library  kernels  and  the  extent  of  their  usage.  Other  tools  might  look  at  generated  assem¬ 
bly  rode  and  make  predictions  about  execution  speed  based  on  the  density  of  vector  com¬ 
putation  opcodes  versus  that  of  scalar,  control,  and  other  “glue"  opcodes. 


17  The  Cray  X-MP  is  available  in  multiple  CPU  configuration*. 


As  with  debusing,  the  problem  of  performance  evaluation  can  lessen  as  compiler  tech¬ 
nology  continues  to  improve.  The  problem  of  tuning  application  programs  should  gradu¬ 
ally  become  a  higher  level  concern  than  it  is  today.  Compilers  will  continue  to  find  new 
ways  of  exploiting  parallelism  at  low  levels,  while  applications  programmers  are  freed  to 
concentrate  on  higher  level  algorithm  design  issues.  Knowledge  gained  by  the  perfor¬ 
mance  analysts  today  will  be  incorporated  in  tomorrow’s  compilers.  As  the  effects  of 
caching  and  memory  interconnection  networks  are  better  understood,  heuristics  for  better 
transformations  can  be  built  into  optimizing  compilers.  These  compilers  should  also  be 
adaptive  —  able  to  generate  efficient  code  for  many  different  vendors’  machines  in  a  par¬ 
ticular  class  and  dependent  only  on  a  list  of  important  parameters  such  as  cache  size,  vec¬ 
tor  register  lengths,  number  of  processors  and  others.  Additionally,  these  compilers  might 
be  able  to  use  data  collected  by  performance  monitoring  systems  in  order  to  further  retine 
compile-time  optimizations.  The  current  static  decisions  about  optimizations  might  be 
dynamic  in  the  future  —  based  on  information  about  the  eventual  run-time  environment 
(such  as  system  loading).  In  this  environment,  the  compiler  system  will  migrate  toward 
an  expert  system  model,  soliciting  information  from  the  user  and  statistics  databases  in 
order  to  provide  optimal  execution  for  a  wide  range  of  applications. 

Integration 

With  the  number  and  complexity  of  software  development  tools  increasing,  the  need 
for  an  integrated  environment  is  becoming  increasingly  necessary  in  the  high  speed  com¬ 
puting  arena.  A  graphics-based  scientific  programming  environment  with  an  integrated 
software  productivity  tool  kit  has  many  things  to  offer  to  the  supercomputer  programmer 
that  cannot  easily  be  offered  by  the  conventional  software  development  tools  being  used 
today. 

First,  the  programming  environment  should  provide  a  consistent  user  interface  para¬ 
digm  across  the  entire  range  of  supported  tools.  Casual  or  infrequent  users  are  not  likely 
to  spend  extensive  sessions  with  user  manuals  learning  the  idiosyncrasies  of  several  tools. 
Rather,  users  will  become  frustrated  with  the  system’s  complexity  and  will  resort  to  func¬ 
tioning  at  the  easiest  possible  level,  thereby  minimizing  his  effort  (and  possibly  his 
efficiency  and  productivity). 

A  well-structured  environment  should  consist  of  a  single  interface  style  through  which 
all  packages  are  accessed.  The  screen  images  seen  by  the  user  during  program  editing 
should  be  the  same  as  the  images  seen  during  debugging  and  program  optimization.  Once 
the  user  is  fluent  with  one  aspect  of  the  system,  several  functions  of  the  system  should  be 
usable  with  minimal  additional  effort.  The  efficacy  of  such  an  approach  can  be  seen  by 
the  case  of  use  and  popularity  of  extant  integrated  window-based  systems  such  as  the 
Apple  Macintosh. 

Additionally,  the  user  interface  should  support  a  graphical  as  well  as  a  textual 
representation  for  programs.  Indeed,  source  text  will  always  be  viewable  and  editable  by 
the  programmer,  but  higher  level  abstractions  such  as  static  subroutine  call  graphs  and 
task /process  graphs  arc  useful  in  understanding  overall  program  structure  more  readily. 
Just  as  graphic  images  can  he  used  to  render  a  concise  representation  of  large  volumes  of 
output  data,  graphic  program  structures  can  be  a  useful  tool  for  the  programmers  wishing 
to  elide  much  of  the  source-level  implementation  details  in  favor  of  perusing  a  terse 
representation  of  an  application's  architecture. 

The  Center  for  Supercomputing  Research  and  Development  at  the  University  of  Illinois 
is  currently  developing  an  environment  that  supports  such  a  programming  model.  Ilie 
environment,  named  Faust,  is  targeted  at  integrating  several  software  development  tools 
through  a  roimnon  window-based  interface.  For  example,  a  user  wanting  to  develop  an 
application  at  the  source-code  level  may  bring  up  a  textual  window  and  enter  Fortran 
source  using  a  conventional  text  editor  (Figure  2).  If  the  user  would  rather  see  the 


PROGRAM  MAIN 
CALL  A 
DO  I  =  1,  30 
CALL  B 
END  DO 
CALL  C 
END 


Figure  2  Simple  application  being  examined  at  the  source  code  level 
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Figure  3  Same  application  at  the  subroutine  interconnect  level 

application  at  a  higher  level,  an  “unzooin”  function  can  ho  invoked  to  bring  up  the 
corresponding  subroutine  interconnection  graph  (Figure  3).  Faust  can  automatically 
create  the  subroutine  call  graph  (if  it  does  not  already  exist)  from  source  code;  however,  if 
desired,  the  user  may  do  the  original  editing  at  the  graphic  level  of  abstraction  and  associ¬ 
ate  source  code  for  each  "block”  as  the  implementation  proceeds.  Faust  also  supports 
other  levels  of  detail  including  process  graphs  that  represent  parallelism  as  well  as  data- 
dependonce  graphs  for  aiding  interactive  restructuring. 

The  concept  of  detail  hiding  is  the  same  rationale  used  to  justify  high-level  languages. 
Programmers  usually  want  to  concentrate  on  a  solution  to  an  application  problem  in  more 
abstract  terms  than  is  possible  through  the  use  of  assembly  language.  For  this  reason, 
high-level  languages  create  the  “language  environment”  in  which  the  programmer  works. 
Hiding  the  details,  however,  is  not  always  desirable.  In  some  operations,  the  programmer 
may  require  the  detailed  information  to  correct  deficiencies  in  his  application  (especially 
with  respect  to  debugging  and  performance  improvement).  To  support  this  facility, 


several  Unix-based  compilers  often  include  an  option  that  allow  a  user  to  see  the  assembly 
code  generated  from  his  source  files.  This  gives  the  user  the  ability  to  work  at  the  higher 
level  abstraction  during  normal  use  as  well  as  to  analyze  machine  level  details  when  neces¬ 
sary.  The  superenvironment  builds  on  this  idea,  providing  more  levels  of  abstraction  that 
are  completely  controllable  by  the  user.  While  this  issue  of  multiple  levels  of  detail  is  not 
specific  to  the  programming  environments  of  supercomputers,  the  additional  complexity 
of  multiple  streams  of  execution  makes  the  abstraction  even  more  desirable. 

The  scientific  programmer’s  environment  needs  to  be  flexible  in  a  number  of  different 
ways.  First,  the  environment  must  support  users  of  varying  levels  of  expertise. 
Engineering-oriented  users  are  likely  to  want  to  concentrate  on  solving  applications  prob¬ 
lems  —  not  performance  problems.  The  environment  should  support  this  group  of  users 
with  an  array  of  automatic  restructuring  tools  and  electronic  experts  that  can  shoulder 
most  of  the  burden  of  achieving  execution  efficiency  while  the  user  focuses  on  his  algo¬ 
rithms  and  application.  Numerical  analysts  and  systems  programmers,  however,  will 
expect  the  environment  to  provide  more  fundamental  tools  for  scrutinizing  the  more  sub¬ 
tle  aspects  of  machine  operation  in  order  to  retrieve  the  low  level  data  they  need  to  fine 
tune  system  libraries.  Somehow  the  environment  must  support  both  ends  of  the  spectrum 
in  a  unified  manner. 

Second,  the  programming  environment  should  be  able  to  support  multiple  types  of 
machines.  Many  styles  of  new  machines  will  be  developed  over  time  and  the  life  of  an 
application  program  will  succeed  several  machine  architectures.  Additionally,  many  users 
support  applications  on  multiple  machines  at  the  same  time,  frequently  moving  applica¬ 
tions  between  them  as  required.  For  these  reasons,  the  environment  should  be  adaptable 
enough  to  support  a  production  application  on  several  vendors’  architectures  without 
requiring  significant  effort  from  the  user.  This  can  be  achieved  by  designing  an  extensible 
interface  through  which  remote  utilities  (such  as  vendor-specific  optimizing  compilers)  can 
be  attached  while  maintaining  the  same  dialogue  and  appearance  to  the  user.  The  local 
utilities  should  also  be  designed  to  work  according  to  heuristics  developed  for  general 
architectural  characteristics  rather  than  machine  specific  idiosyncrasies  (although,  as  men¬ 
tioned  above,  numerical  analysts  will  want  to  take  advantage  of  machine-specific 
phenomena  when  building  heavily  used  kernels).  For  example,  a  restructuring  compiler 
that  does  transformations  for  a  generic  vector  processor  with  vector  register  length  n  can 
be  useful  for  a  number  of  different  machines  just  by  supplying  the  appropriate  n  for  the 
machine  of  interest. 

Finally,  the  environment  should  support  a  wide  variety  of  language  domains.  While 
Fortran  is  certainly  necessary,  languages  such  as  C,  Ada,  Pascal,  Val,  and  others  must 
also  be  considered.  Future  developments  are  also  likely  to  include  languages  of  a  more 
symbolic  nature.  Interactive  environments  built  on  systems  such  as  Maxima  (Ref.  44)  and 
Reduce  (Ref.  51)  could  provide  a  very  useful  function,  offering  a  higher  level  of  communi¬ 
cations  to  scientific  users  who  would  prefer  to  express  problems  in  a  representation  that 
more  closely  resembles  mathematical  notation  than  procedural  source  code.  Ultimately, 
programming  environments  will  evolve  to  transcend  the  mundane  details  of  traditional 
programming,  allowing  scientists  and  engineers  to  converse  in  a  language  more  familiar  to 
them  while  the  environment  fills  the  gap  between  the  symbolic  representations  and  the 
encoding  required  by  the  underlying  hardware. 

The  scientific  programming  environment  is  well  suited  to  the  workstation  hardware 
offered  by  manufacturers  such  as  Sun,  DEC,  Apollo  and  others.  These  nodes  consist  of 
bitmapped  graphi*  screens  attached  to  a  32-bit  microprocessor  running  Unix.  While  run¬ 
ning  the  environment's  “front  end"  software  on  another  host  (the  workstation)  poses 
some  technical  problems  with  respect  to  implementation  of  control  and  communication 
links,  this  configuration  offers  the  ability  to  run  screen  intensive  user  interaction  support 
functions  locally  which  provides  several  benefits.  First,  the  user  need  not  use  expensive 
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supercomputer  hardware  to  service  functions  such  as  text  editing  and  graphics  manage¬ 
ment.  Second,  keeping  much  of  the  functionality  within  the  workstation  helps  promote 
the  desired  goal  of  supercomputer  vendor  independence.  This  common  front  end  also  pro¬ 
motes  familiarity  across  systems  through  the  enforcement  of  a  common  user  interface 
between  the  application  programmer  and  the  supercomputer.  Finally,  having  local  intelli¬ 
gence  in  the  workstation  gives  the  user  more  consistent  response  from  day  to  day. 
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